For this project you will be doing the Bike Sharing Demand Kaggle challenge!. The main point of this project is to get you feeling comfortabe with Exploratory Data Analysis and begin to get an understanding that sometimes certain models are not a good choice for a data set. In this case, we will discover that Linear Regression may not be the best choice given our data!
Just complete the tasks outlined below.
You can download the data or just use the supplied csv in the repository. The data has the following features:
head(bike)
str(bike$count)
## int [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
library(ggplot2)
ggplot(data = bike, aes(x = temp, y = count)) + geom_point(aes(color = temp, alpha = 0.5))
ggplot(data = bike, aes(x = as.POSIXct(datetime), y = count)) + geom_point(aes(alpha = 0.5, color = temp)) + scale_color_gradient(low = "darkgreen", high = "orange")
cor(bike$temp, bike$count)
## [1] 0.3944536
ggplot(data = bike, aes(y = count, x = factor(season))) + geom_boxplot(aes(color = factor(season)))
A line can’t capture a non-linear relationship. There are more rentals in winter than in spring
A lot of times you’ll need to use domain knowledge and experience to engineer and create new features. Let’s go ahead and engineer some new features from the datetime column.
time.stamp <- bike$datetime format(time.stamp, “%H”)
bike$hour <- strftime(bike$datetime, format="%H")
head(bike)
Use the additional layer: scale_color_gradientn(colors=c(‘color1’,color2,etc..)) where the colors argument is a vector gradient of colors you choose, not just high and low. Use position=position_jitter(w=1, h=0) inside of geom_point() and check out what it does.
ggplot(data = subset(x = bike, subset = workingday == 1), aes(x = hour, y = count)) + geom_point(position=position_jitter(w=1, h=0),aes(color = temp), alpha =0.5) + scale_color_gradient2(low = "blue", mid = "green", high = "red")
#### Now create the same plot for non working days:
ggplot(data = subset(x = bike, subset = workingday == 0), aes(x = hour, y = count)) + geom_point(position=position_jitter(w=1, h=0),aes(color = temp), alpha =0.5) + scale_color_gradient2(low = "blue", mid = "green", high = "red")
#### You should have noticed that working days have peak activity during the morning (~8am) and right after work gets out (~5pm), with some lunchtime activity. While the non-work days have a steady rise and fall for the afternoon
temp.model <- lm(count ~ temp, data = bike)
summary(temp.model)
##
## Call:
## lm(formula = count ~ temp, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.32 -112.36 -33.36 78.98 741.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0462 4.4394 1.362 0.173
## temp 9.1705 0.2048 44.783 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.5 on 10884 degrees of freedom
## Multiple R-squared: 0.1556, Adjusted R-squared: 0.1555
## F-statistic: 2006 on 1 and 10884 DF, p-value: < 2.2e-16
6.0462 + 9.1705*25
## [1] 235.3087
predict(temp.model, data.frame(temp = c(25)))
## 1
## 235.3097
bike$hour <- sapply(bike$hour, as.numeric )
season holiday workingday weather temp humidity windspeed hour (factor)
lm.model <- lm(count ~ . -casual - registered -datetime -atemp,bike)
summary(lm.model)
##
## Call:
## lm(formula = count ~ . - casual - registered - datetime - atemp,
## data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -324.61 -96.88 -31.01 55.27 688.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91369 8.45147 5.551 2.91e-08 ***
## season 21.70333 1.35409 16.028 < 2e-16 ***
## holiday -10.29914 8.79069 -1.172 0.241
## workingday -0.71781 3.14463 -0.228 0.819
## weather -3.20909 2.49731 -1.285 0.199
## temp 7.01953 0.19135 36.684 < 2e-16 ***
## humidity -2.21174 0.09083 -24.349 < 2e-16 ***
## windspeed 0.20271 0.18639 1.088 0.277
## hour 7.61283 0.21688 35.102 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.8 on 10877 degrees of freedom
## Multiple R-squared: 0.3344, Adjusted R-squared: 0.3339
## F-statistic: 683 on 8 and 10877 DF, p-value: < 2.2e-16